5000 Fastest Growing Private Companies in U.S. by Anna Li

Summary statistics

Preliminary look at the Inc. 5000 Company List data set.

List of all variables in the dataframe

##  [1] "row_num"     "id"          "rank"        "workers"     "company"    
##  [6] "url"         "state_l"     "state_s"     "city"        "metro"      
## [11] "growth"      "revenue"     "industry"    "yrs_on_list"

Dimensions of the dataframe

## [1] 5000   14

Structure of dataframe with preview of data values

## 'data.frame':    5000 obs. of  14 variables:
##  $ row_num    : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ id         : int  22890 25747 25643 26098 26182 22913 22937 25413 26079 25861 ...
##  $ rank       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ workers    : int  227 191 145 62 92 50 129 130 264 11 ...
##  $ company    : Factor w/ 5000 levels "(add)ventures",..: 1725 3569 3651 4211 79 3520 1094 3357 4703 1826 ...
##  $ url        : Factor w/ 5000 levels "@properties",..: 1725 3569 3647 4211 76 3520 1094 3357 4703 1826 ...
##  $ state_l    : Factor w/ 51 levels "Alabama","Alaska",..: 5 5 48 5 22 20 5 3 38 34 ...
##  $ state_s    : Factor w/ 51 levels "AK","AL","AR",..: 5 5 47 5 20 22 5 4 38 28 ...
##  $ city       : Factor w/ 1352 levels "Acton","Ada",..: 355 355 37 930 737 46 1142 1103 979 1325 ...
##  $ metro      : Factor w/ 326 levels "","Adrian MI",..: 171 171 314 262 39 163 261 226 229 321 ...
##  $ growth     : num  158957 57348 55460 26043 20690 ...
##  $ revenue    : num  195640000 82640563 85076502 35293000 77652360 ...
##  $ industry   : Factor w/ 25 levels "Advertising & Marketing",..: 5 11 2 23 24 7 13 13 25 7 ...
##  $ yrs_on_list: int  2 1 1 1 1 2 2 1 1 1 ...

Explore factor variables and the different levels in State and Industry

##  [1] "Alabama"              "Alaska"               "Arizona"             
##  [4] "Arkansas"             "California"           "Colorado"            
##  [7] "Connecticut"          "Delaware"             "District of Columbia"
## [10] "Florida"              "Georgia"              "Hawaii"              
## [13] "Idaho"                "Illinois"             "Indiana"             
## [16] "Iowa"                 "Kansas"               "Kentucky"            
## [19] "Louisiana"            "Maine"                "Maryland"            
## [22] "Massachusetts"        "Michigan"             "Minnesota"           
## [25] "Mississippi"          "Missouri"             "Montana"             
## [28] "Nebraska"             "Nevada"               "New Hampshire"       
## [31] "New Jersey"           "New Mexico"           "New York"            
## [34] "North Carolina"       "North Dakota"         "Ohio"                
## [37] "Oklahoma"             "Oregon"               "Pennsylvania"        
## [40] "Puerto Rico"          "Rhode Island"         "South Carolina"      
## [43] "South Dakota"         "Tennessee"            "Texas"               
## [46] "Utah"                 "Vermont"              "Virginia"            
## [49] "Washington"           "West Virginia"        "Wisconsin"
##  [1] "Advertising & Marketing"      "Business Products & Services"
##  [3] "Computer Hardware"            "Construction"                
##  [5] "Consumer Products & Services" "Education"                   
##  [7] "Energy"                       "Engineering"                 
##  [9] "Environmental Services"       "Financial Services"          
## [11] "Food & Beverage"              "Government Services"         
## [13] "Health"                       "Human Resources"             
## [15] "Insurance"                    "IT Services"                 
## [17] "Logistics & Transportation"   "Manufacturing"               
## [19] "Media"                        "Real Estate"                 
## [21] "Retail"                       "Security"                    
## [23] "Software"                     "Telecommunications"          
## [25] "Travel & Hospitality"

Summary of the data set

##     row_num           id             rank         workers     
##  Min.   :   0   Min.   :    4   5000   :   1   Min.   :    0  
##  1st Qu.:1250   1st Qu.:19575   4999   :   1   1st Qu.:   24  
##  Median :2500   Median :23292   4998   :   1   Median :   50  
##  Mean   :2500   Mean   :20037   4997   :   1   Mean   :  209  
##  3rd Qu.:3749   3rd Qu.:25370   4996   :   1   3rd Qu.:  125  
##  Max.   :4999   Max.   :26620   4995   :   1   Max.   :34219  
##                                 (Other):4994                  
##            company                 url             state_l    
##  (add)ventures :   1   @properties   :   1   California: 694  
##  @Properties   :   1   110-consulting:   1   Texas     : 404  
##  110 Consulting:   1   123stores     :   1   New York  : 335  
##  123Stores     :   1   180           :   1   Florida   : 303  
##  180           :   1   180fusion     :   1   Virginia  : 284  
##  180Fusion     :   1   1seocom       :   1   Illinois  : 238  
##  (Other)       :4994   (Other)       :4994   (Other)   :2742  
##     state_s            city                metro          growth         
##  CA     : 694   New York : 178   New York City: 399   Min.   :    42.45  
##  TX     : 404   Chicago  :  95   Washington DC: 316   1st Qu.:    84.21  
##  NY     : 335   Atlanta  :  94   Los Angeles  : 274   Median :   151.72  
##  FL     : 303   Austin   :  87   Chicago      : 224   Mean   :   516.44  
##  VA     : 284   San Diego:  80   Atlanta      : 194   3rd Qu.:   347.65  
##  IL     : 238   Houston  :  76   Dallas       : 169   Max.   :158956.91  
##  (Other):2742   (Other)  :4390   (Other)      :3424                      
##     revenue                                   industry     yrs_on_list    
##  Min.   :   1953000   IT Services                 : 733   Min.   : 1.000  
##  1st Qu.:   4876791   Advertising & Marketing     : 453   1st Qu.: 1.000  
##  Median :  10722077   Business Products & Services: 435   Median : 2.000  
##  Mean   :  43058182   Health                      : 377   Mean   : 2.744  
##  3rd Qu.:  26952131   Software                    : 338   3rd Qu.: 4.000  
##  Max.   :5528202691   Financial Services          : 278   Max.   :12.000  
##                       (Other)                     :2386

Initial Observations from a summary of the data set

  • There are 5000 companies ranked from 1 to 5000 based on their percentage growth in 2014, from greatest rate of growth (ranked 1) to slowest rate of growth (ranked 5000).
    • Greatest rate of growth is 158956.91%, lowest is 42.45%
  • There are companies representing all 50 states plus one territory (Puerto Rico), resulting in 51 levels for state.
  • The minimum number of works is 0 (need to explore further how it is possible to have no employees) with the maximum at 34219. Most companies on the list have under 150 employees.
  • The top 5 states with the greatest number of companies on the list are: California, Texas, New York, Florida, and Virginia. But the top 5 cities with the greatest number of companies on the list are: New York, Chicago, Atlanta, Austin, and San Diego. It may be worth figuring out why the top states and top cities don’t match.
  • The top industries representing greatest growth are: IT, Ad & Marketing, Business Products & Services, Health, and Software.
  • For most companies, it is their first or second year on the list. About a quarter have been on the list for more than 4 times, with 12 years being the highest number of years any one company has been on the list.

Univariate Plots Section

Histogram of states where companies are located

Histograms of workers by count

First plot doesn’t have small enough binwidths to see the trend. Reduce binwidth shows a histogram plot that skews right. What happens to distribution if I perform a long10 transformation?

Transforming the long tail by taking the log10 of workers helps better understand the distribution of workers. The transformed workers distribution looks close to a normal distribution with a longer tail on the right.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      24      50     209     125   34220

Distribution of industry

Distribution of revenue

## [1]    1953000 5528202691

Distribution of growth

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     42.45     84.21    151.70    516.40    347.70 159000.00

The distribution of growth and revenue look really similar. Let’s try another type of plot to tease apart how the distributions differ. The frequency polygon plot better shows the diffent shapes of the distributions. The amount of growth is based on revenue generated so it is not surprising the two distributions are similar since they are highly correlated.

Many of the highest ranked companies are small businesses. This could be because smaller companies grow faster than big public companies. But it could also be that smaller companies are starting with smaller amounts of revenues. Absolute growth in dollars is different from percentage growth. For example, company with no revenue the previous year that gains some revenue the next year has infinite percentage growth. But this isn’t a good reflection on how much revenue the company is generating compared to another company that’s making more in absolute revenue but has a lower percentage growth.

I created two new variables, revenue 2013, calculated in terms of current revenue and percentage growth to derive last year’s revenue, and growth in dollars, which is revenue 2013 substracted from revenue 2014.

## [1] 123000 143853 153125 135000 373500 690697
## [1] 195517000  82496710  84923377  35158000  77278860 137286506

There is a limitation in my data set. Without data about resident populations in each state or city or metro area it is hard to determine whether the states with the highest number of growing companies have growing companies because there are more people living there or if there is something special about that state that fosters growth. Therefore, I looked for population data from the U.S. Census Bureau and found population estimates for 2010 to 2014. This works with the company data from 2014 with the reverse engineered revenue and growth numbers I calculated for 2013.

##   Geographic_Area Census_April1 Estimate_Base    Est_2010    Est_2011
## 1   United States   308,745,538   308,758,105 309,347,057 311,721,632
## 2       Northeast    55,317,240    55,318,348  55,381,690  55,635,670
## 3         Midwest    66,927,001    66,929,898  66,972,390  67,149,657
## 4           South   114,555,744   114,562,951 114,871,231 116,089,908
## 5            West    71,945,553    71,946,908  72,121,746  72,846,397
## 6         Alabama     4,779,736     4,780,127   4,785,822   4,801,695
##      Est_2012    Est_2013    Est_2014
## 1 314,112,078 316,497,531 318,857,056
## 2  55,832,038  56,028,220  56,152,333
## 3  67,331,458  67,567,871  67,745,108
## 4 117,346,322 118,522,802 119,771,934
## 5  73,602,260  74,378,638  75,187,681
## 6   4,817,484   4,833,996   4,849,377
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:GGally':
## 
##     nasa
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## [1] "state_l"              "state_growth_dollar"  "state_population2014"
## 'data.frame':    51 obs. of  3 variables:
##  $ state_l             : Factor w/ 51 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ state_growth_dollar : num  718058962 4966627 2379822839 71079926 18309472149 ...
##  $ state_population2014: chr  "4,849,377" "736,732" "6,731,484" "2,966,369" ...

Univariate Analysis

What is the structure of your dataset?

I have two datasets. The original dataset is a list of the 5000 fastest growing private companies in 2014 in the U.S. from Inc. 5000. The second dataset I have is state population data from the Census Bureau. I have two resulting data frames: companies is the Inc. 5000 data set with new variables added, and state_growth is population data with additional variables.

What is/are the main feature(s) of interest in your dataset?

The variables most interesting to explore are the growth in percentage and dollar amounts since the dataset from Inc. 5000 is specifically about the fastest growing private companies in the U.S. I am also very interested in the industry the companies are in. ### What other features in the dataset do you think will help support your investigation into your feature(s) of interest? Revenue will be important way to understand growth. For example, a company with a small revenue will see greater gains in percentage growth than a company with larger revenue amount but the latter could have a much greater revenue and growth in absolute dollar amounts. So it is critical to interpret growth in light of revenue.

State population data is also important to better understand growth. A larger state might appear to have greater growth in absolute dollar amounts but that could be influenced by a greater population. Therefore investigating growth per capita can provide a fairer way to look at growth, especially from the point of view of smaller states.

Did you create any new variables from existing variables in the dataset?

I created 4 new variables from existing varibles across two datasets I created two new variables in the companies data frame: 1. revenue2013, 2. growth_dollar. I reverse engineered revenue from 2013 using revenue from 2014 and percentage growth. Then I substracted the 2013 revenue from 2014 revenue to get the growth_dollar.

I also created a new dataframe using the state population data from the census. In this dataframe, I added two other variables: 3. state_growth_dollar and 4. growth_per_capita. state_growth_dollar was calculated by grouping together states and summing the growth_dollar derived from the 2nd variable I created growth_dollar. The growth_per_capita variable was created by dividing growth_dollar by the state population.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The revenue, growth, and workers histograms all skewed right with a very long tail. I had to perform a log transformation to better understand the data. I performed a lot of tidying and adjusting to import and join the two data frames, including converting the population data to a numeric because the commas that separated the thousands place was causing the read.csv() command to import population numbers as characters. I needed population numbers to be numeric so I could perform division to calculate the growth_per_capita.

geom_boxplot, geom_point, geom_violin, geom_jitter with geom_rug, geom_point(stat = ‘summary’), geom_bin2d, geom_tile, geom_density2d, geom_point(alpha = 1/10, color = ‘gray’) + geom_line(stat = ‘summary’, fun.y = median), geom_point(alpha = 1/10, color = ‘gray’) + geom_step(stat = ‘summary’, fun.y = median) # Bivariate Plots Section

Which state has greatest revenue growth per capita in 2014?

## Warning in loop_apply(n, do.ply): Removed 22 rows containing missing values
## (geom_point).

## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## Warning in loop_apply(n, do.ply): Removed 61 rows containing missing values
## (geom_point).

## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## Warning in loop_apply(n, do.ply): Removed 469 rows containing missing
## values (geom_point).

## 'data.frame':    5000 obs. of  16 variables:
##  $ row_num          : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ id               : int  22890 25747 25643 26098 26182 22913 22937 25413 26079 25861 ...
##  $ rank             : Ord.factor w/ 5000 levels "5000"<"4999"<..: 5000 4999 4998 4997 4996 4995 4994 4993 4992 4991 ...
##  $ workers          : int  227 191 145 62 92 50 129 130 264 11 ...
##  $ company          : Factor w/ 5000 levels "(add)ventures",..: 1725 3569 3651 4211 79 3520 1094 3357 4703 1826 ...
##  $ url              : Factor w/ 5000 levels "@properties",..: 1725 3569 3647 4211 76 3520 1094 3357 4703 1826 ...
##  $ state_l          : Factor w/ 51 levels "Alabama","Alaska",..: 5 5 48 5 22 20 5 3 38 34 ...
##  $ state_s          : Factor w/ 51 levels "AK","AL","AR",..: 5 5 47 5 20 22 5 4 38 28 ...
##  $ city             : Factor w/ 1352 levels "Acton","Ada",..: 355 355 37 930 737 46 1142 1103 979 1325 ...
##  $ metro            : Factor w/ 326 levels "","Adrian MI",..: 171 171 314 262 39 163 261 226 229 321 ...
##  $ growth_percentage: num  158957 57348 55460 26043 20690 ...
##  $ revenue2014      : num  195640000 82640563 85076502 35293000 77652360 ...
##  $ industry         : Factor w/ 25 levels "Advertising & Marketing",..: 5 11 2 23 24 7 13 13 25 7 ...
##  $ yrs_on_list      : int  2 1 1 1 1 2 2 1 1 1 ...
##  $ revenue2013      : num  123000 143853 153125 135000 373500 ...
##  $ growth_dollar    : num  195517000 82496710 84923377 35158000 77278860 ...
## 'data.frame':    5000 obs. of  7 variables:
##  $ workers          : int  227 191 145 62 92 50 129 130 264 11 ...
##  $ growth_percentage: num  158957 57348 55460 26043 20690 ...
##  $ revenue2014      : num  195640000 82640563 85076502 35293000 77652360 ...
##  $ industry         : Factor w/ 25 levels "Advertising & Marketing",..: 5 11 2 23 24 7 13 13 25 7 ...
##  $ yrs_on_list      : int  2 1 1 1 1 2 2 1 1 1 ...
##  $ revenue2013      : num  123000 143853 153125 135000 373500 ...
##  $ growth_dollar    : num  195517000 82496710 84923377 35158000 77278860 ...
##   workers growth_percentage revenue2014                     industry
## 1     227         158956.91   195640000 Consumer Products & Services
## 2     191          57347.92    82640563              Food & Beverage
## 3     145          55460.16    85076502 Business Products & Services
## 4      62          26042.96    35293000                     Software
## 5      92          20690.46    77652360           Telecommunications
## 6      50          19876.52   137977203                       Energy
##   yrs_on_list revenue2013 growth_dollar
## 1           2      123000     195517000
## 2           1      143853      82496710
## 3           1      153125      84923377
## 4           1      135000      35158000
## 5           1      373500      77278860
## 6           2      690697     137286506

## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0

## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0

## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0

## 
##  Pearson's product-moment correlation
## 
## data:  companies$revenue2014 and companies$growth_percentage
## t = 0.1213, df = 4998, p-value = 0.9035
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.02600509  0.02943333
## sample estimates:
##         cor 
## 0.001715438
## 
##  Pearson's product-moment correlation
## 
## data:  companies$revenue2014 and companies$growth_dollar
## t = 208.4471, df = 4998, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9440788 0.9498019
## sample estimates:
##       cor 
## 0.9470155

## 
##  Pearson's product-moment correlation
## 
## data:  companies$revenue2013 and companies$growth_percentage
## t = -2.016, df = 4998, p-value = 0.04386
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.056179170 -0.000785593
## sample estimates:
##         cor 
## -0.02850427
## 
##  Pearson's product-moment correlation
## 
## data:  companies$revenue2013 and companies$growth_dollar
## t = 84.3823, df = 4998, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7548497 0.7777247
## sample estimates:
##       cor 
## 0.7665302

## Warning in loop_apply(n, do.ply): Removed 64 rows containing missing values
## (geom_point).

## 
##  Pearson's product-moment correlation
## 
## data:  companies$yrs_on_list and companies$revenue2014
## t = 10.7641, df = 4998, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1233182 0.1775016
## sample estimates:
##      cor 
## 0.150523
## 
##  Pearson's product-moment correlation
## 
## data:  companies$yrs_on_list and companies$revenue2013
## t = 12.0087, df = 4998, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1403959 0.1942809
## sample estimates:
##       cor 
## 0.1674635

## Warning in loop_apply(n, do.ply): Removed 37 rows containing non-finite
## values (stat_boxplot).

## Warning in loop_apply(n, do.ply): Removed 37 rows containing non-finite
## values (stat_ydensity).

## Warning in loop_apply(n, do.ply): Removed 124 rows containing missing
## values (geom_point).

## Warning in loop_apply(n, do.ply): Removed 31 rows containing missing values
## (geom_point).

## Warning in loop_apply(n, do.ply): Removed 64 rows containing missing values
## (geom_point).

## Warning in loop_apply(n, do.ply): Removed 21 rows containing missing values
## (geom_point).

## Warning in loop_apply(n, do.ply): Removed 69 rows containing missing values
## (geom_point).

## Warning in loop_apply(n, do.ply): Removed 469 rows containing missing
## values (geom_point).

## Warning in loop_apply(n, do.ply): Removed 65 rows containing missing values
## (geom_point).

## Warning in loop_apply(n, do.ply): Removed 362 rows containing missing
## values (geom_point).

geom_boxplot, geom_point, geom_violin, geom_jitter with geom_rug, geom_point(stat = ‘summary’), geom_bin2d, geom_tile, geom_density2d, geom_point(alpha = 1/10, color = ‘gray’) + geom_line(stat = ‘summary’, fun.y = median), geom_point(alpha = 1/10, color = ‘gray’) + geom_step(stat = ‘summary’, fun.y = median) # Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

What was the strongest relationship you found?

Multivariate Plots Section

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Were there any interesting or surprising interactions between features?

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

Plot Two

Description Two

Plot Three

Description Three


Reflection